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Abstract-We  explore  the  application  of  a  novel  classification 
method  that  combines  supervised  and  unsupervised  training,  and 
compare  its  performance  to  various  more  classical  methods.  We 
first  construct  a  detailed  high  dimensional  representation  of  the 
speech  signal  using  Lyon’s  cochlear  model  and  then  optimally  re¬ 
duce  its  dimensionality.  The  resulting  low  dimensional  projection 
retains  the  information  needed  for  robust  speech  recognition. 


INTRODUCTION  -  SPEECH  PREPROCESSING  METHODS 

Many  speech  recognition  systems,  in  particular,  those  based  on  HMMs, 
use  LPC  derived  cepstral  coefficients  as  the  first  step  in  preprocessing  the 
speech  data.  These  cepstra  are  then  typically  passed  through  vector  quantiza¬ 
tion  (VQ),  or  used  directly  as  input  to  the  HMM.  The  VQ  step  discretizes  the 
multidimensional  input  vectors  into  a  small  set  of  possible  inputs.  This  helps 
simplify  training  the  system,  but  also  introduces  varying  degrees  of  distor¬ 
tion  [ll].  This  limitation  is  partially  overcome  by  using  methods  to  estimate 
output  parameters  for  the  continuous  space  defined  by  the  cepstra.  These 
techniques  also  run  into  problems  when  the  dimensionality  of  the  input  vec¬ 
tor  gets  large.  In  spite  of  these  potential  problems,  LPC-based  systems  have 
performed  well,  especially  when  augmented  with  energy  and  time-differenced 
cepstra  [ll]. 

Speech  recognition  systems  using  ANNs  have  employed  a  much  more  het¬ 
erogenous  set  of  preprocessing  techniques.  Everything  from  raw  speech  to 
LPC-based  cepstra  has  been  tried  [12).  However,  most  have  used  some  form 
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of  preprocessing  inspired  by  the  representation  produced  by  the  inamniaiian 
peripheral  auditory  system.  Examples  include  Mel  scale  and  bark  scale  spec¬ 
tra.  Other  more  sophisticated  techniques  exist  that  produce  more  detailed 
representations. 

While  there  is  a  tendency  for  preprocessing  based  on  auditory  system 
constraints  to  be  used  with  ANNs  and  preprocessing  based  on  vocal  tract 
constraints  to  be  used  with  HMMs,  this  is  not  always  the  caise.  For  instance, 
some  current  HMM  systems  include  a  Mel  scale  transformation  when  comput¬ 
ing  cepstra,  and  as  mentioned  above,  LPC-based  ccpstra  have  been  used  with 
ANNs.  The  differences  in  preprocessing  for  HMMs  and  ANNs  can  be  largely 
attributed  to  the  fact  that  ANNs  are  good  at  integrating  over  large  dimen¬ 
sional  representations,  while  HMMs  do  best  with  much  smaller  dimensional 
input. 

In  this  papei  we  focus  on  ANN  techniques  for  processing  the  detailed,  high 
dimensional  auditory  system  representation  of  speech  produced  by  Lyon’s 
cochlear  model  [13].  We  explore  the  application  of  a  novel  classification 
method  that  combines  supervised  and  unsupetvised  training,  and  compare 
its  performance  to  various  methods.  Our  task  is  feature  extraction  and  clas¬ 
sification  of  voiceless  stops  extracted  from  the  TIMIT  corpus. 

What  are  features  of  recognition  for  speech  data 

When  moving  to  a  much  larger  representation  of  the  speech  data,  many 
existing  techniques  such  as  clcissifiers,  or  vector  quantizers  fail  to  work,  mainly 
because  of  the  curse  of  dimensionality  [1].  This  problem  is  related  to  the 
sparsity  of  high  dimensional  spaces,  and  implies  that  the  amount  of  training 
data  has  to  grow  exponentially  with  the  dimensionality. 

In  many  cases,  it  is  conceivable  to  assume  that  the  important  informa¬ 
tion  for  speech  recognition  »'es  in  a  much  smaller  dimensional  space,  and 
the  question  becomes,  how  to  find  this  low  dimensional  structure,  or  how  to 
extract  the  relevant  features  from  the  data.  This  question  can  be  put  in  a 
much  broader  statistical  formulation,  in  which  one  has  a  data  set  that  lies  in 
high  dimensional  space,  with  a  lower  dimensional  structure  and  tries  to  re¬ 
duce  the  dimensionality  of  the  data,  without  losing  the  important  structure. 
These  problems  may  be  addressed  using  a  recent  statistical  tool  called  Ex¬ 
ploratory  Projection  Pursuit  [3]  which  has  an  effective  implementation  with 
a  biologically  motivated  neural  network  [6]. 


LYON’S  MODEL  OF  COCHLEAR  PROCESSING 


We  chose  to  use  a  fairly  sophisticated  auditory  model  to  preprocess  the 
speech  data  for  our  neural  network.  One  reason  for  doing  this  was  to  assess 
the  feasibility  of  using  such  a  model  as  front  end  for  a  recognizer.  Auditory 
models  typically  produce  very  large  output  representations  in  order  to  retain 
much  of  the  detail  the  higher  centers  in  the  brain  receive  from  the  cochlea. 
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The  auditory  model  we  used  to  preprocess  the  speech  data  was  Lyon’s 
cochlear  model  [13]  as  implemented  by  Slaney  [IS].  For  each  time  slice,  84 
channels  of  output  were  produced  (for  data  sampled  at  l6kHz).  Time  slices 
were  separated  by  2  msecs.  Therefore,  for  56  msec  of  speech,  the  model 
produced  2352  bytes  of  data.  While  this  is  still  orders  of  magnitude  smaller 
than  what  is  transmitted  through  the  auditory  nerve  to  higher  centers  of  the 
brain,  it  is  much  larger  than  the  data  representations  typically  used  for  speech 
recognition  tasks. 

The  channels  in  the  model  correspond  to  nerve  fibers  evenly  spaced  along 
the  basilar  membrane  in  the  cochlea.  The  center  frequencies  of  the  set  of 
channels  are  logarithmically  spaced,  giving  the  lower  frequencies  a  more  dense 
representation  than  the  higher  frequencies.  Neighboring  channels  overlap  to 
a  large  degree.  This  models  the  highly  redundant  representation  used  by  the 
mammalian  auditory  nerve.  The  band  pass  regions  of  the  channels  increase 
linearly  with  frequency. 

Each  channel  is  implemented  as  a  second  order  digital  filter.  The  entire 
filter  bank  is  implemented  with  a  cascade  design  giving  the  representation 
realistic  amplitude  and  group-delay  response  in  addition  to  making  the  com¬ 
putation  efficient.  To  model  the  effects  of  the  inner  and  outer  ear,  the  signal 
is  passed  through  a  pre-emphasis  stage  and  then  processed  by  the  cascade  of 
second  order  filters.  The  final  stage  of  processing  is  preceded  by  half-wave 
rectification  to  model  the  unidirectional  transduction  of  the  basilar  membrane 
movement  by  the  inner  hair  cells. 

The  final  phase  of  the  cochlear  model  passes  the  output  of  each  channel 
through  a  series  of  adaptive  gain  control  (AGC)  elements.  These  AGC  ele¬ 
ments  attempt  to  keep  the  output  levels  of  each  filter  within  specific  range. 
Each  AGC  is  coupled  with  its  nearest  neighbors  to  each  side.  This  helps 
model  the  masking  effects  found  in  real  cochlear  processing.  The  result¬ 
ing  rectangular  frequency  by  time  representation  forms  an  image  of  auditory 
nerve  activity  and  is  called  a  cochleagram. 

In  sum,  much  of  the  detail  and  character  of  the  representation  used  by 
the  auditory  nerve  is  retained  in  the  cochleagram  representation  The  task 
then  becomes  how  to  best  use  all  of  this  information. 


FEATURE  EXTRACTION  IN  HIGH  DIMENSIONAL  SPACE  - 
THE  BCM  MODEL 

From  a  mathematical  view  point,  extracting  features  from  the  rectangu¬ 
lar  representation  of  the  cochleagram  is  related  to  dimensionality  reduction 
in  high  dimensional  vector  space,  in  which  an  n  x  I:  pixel  image  is  considered 
to  be  a  vector  of  length  n  x  k.  In  such  high  dimensional  spaces  the  curse  of 
dimensionality  [1]  says  that  it  is  impossible  to  base  the  recognition  on  the  high 
dimensional  vectors,  because  the  number  of  training  patterns  needed  for  train¬ 
ing  a  classifier  should  increrise  in  an  exponential  order  with  the  dimensionality, 
and  therefore  dimensionality  reduction  should  take  place  before  attempting 


the  classification.  Due  to  the  large  number  of  parameters  involved,  a  feature 
extraction  method  that  uses  the  class  labels  of  the  data,  will  be  biased  to  the 
training  data  [5],  vvhich  translates  to  having  features  with  poor  generalization 
or  invariance  properties.  Thus,  the  feature  extraction  should  be  unsupervised. 
A  recent  statistical  method  to  address  this  problem  of  dimensionality  reduc¬ 
tion  called  exploratory  projection  pursuit  (EPP)  (3]  assumes  that  features 
can  be  constructed  from  projections  of  the  input  space  onto  a  small  dimen¬ 
sional  space.  This  method  defines  interesting  features  as  those  projections 
whose  single  dimensional  projected  distribution  is  far  from  Gaussian.  Since 
high  dimensional  clusters  translate  to  low  dimensional  multi-modal  projected 
distributions,  a  plausible  measure  of  deviation  from  normality  can  be  based 
on  a  measure  of  multi-modality  of  the  projected  distribution.  Intrator  [6]  has 
recently  shown  that  a  variation  of  the  Bienenstock  Cooper  and  Munro  neu¬ 
ron  [2]  performs  exploratory  projection  pursuit  using  a  projection  index  that 
measures  multi-modality.  A  network  implementation  which  can  find  several 
projections  in  parallel  is  still  computationally  efficient  and  therefore  may  be 
applicable  for  extracting  features  from  very  high  dimensional  vector  spaces 
of  the  type  generated  by  the  cochlear  model. 

The  unsupervised  feature  extraction/classification  method  is  presented  in 
Figure  1.  Similar  approaches  using  the  RCE  and  back-propagation  network 
have  been  carried  out  by  [15],  and  using  the  unsupervised  charge  clustering 
network  by  Scofield  [17],  Huang  and  Lippmann  [4]  described  a  feature-map 
classifier  for  vowel  recognition,  in  which  internal  nodes  compute  kernel  func¬ 
tions  related  to  the  Euclidean  distance  between  the  input  and  cluster  centers 
represented  by  these  nodes.  The  unsupervised  vector  quantizer  was  trained 
to  form  the  new  representation  which  trained  the  supervised  classifier.  Koho- 
nen  et  al.  [10]  used  a  similar  approach  with  LVQ  network.  Review  on  various 
other  unsupervised/supervised  approaches  appears  in  [12]. 

Although  unsupervised  feature  extraction  has  the  potential  of  being  less 
biased  to  the  training  data,  its  result  may  be  suboptima]  since  it  ignores  the 
information  contained  in  the  cl^ws  labels.  It  is  possible  for  example,  that 
not  all  the  information  required  for  the  classification  is  contained  in  those 
directions  which  are  considered  interesting  by  the  feature  extractor  (some 
trivial  examples  are  discussed  in  [8]).  Therefore,  it  is  possible  that  a  hybrid 
of  unsupervised/supervised  feature  extractor  may  yield  better  performance. 

Another  way  to  look  at  the  problem  is  from  the  classification  side;  The 
performance  of  the  classifier  that  reduces  dimensionality  based  solely  on  the 
class  labels,  may  be  improved  if  an  additional  measure  of  the  information 
carried  in  the  projections  is  added.  In  the  case  of  a  back-propagation  classi¬ 
fication  network,  a  local  penalty  term  may  be  added  to  the  energy  functional 
minimized  by  error  back  propagation.  This  penalty  which  is  added  only  to 
the  hidden  layer  units,  is  the  projection  index  defined  by  the  BCM  network 
[6,  9].  Therefore,  the  modification  equations  for  the  hidden  layer  units  are 
affected  by  the  delta  rule  [16]  and  by  the  BCM  modification  equations.  This 
method  is  described  in  detail  in  [7]. 
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Figure  1;  Low  dimensional  classifier  is  trained  on  features  extracted  from  the 
high  dimensional  data.  Training  of  the  feature  extraction  network  stops  when 
the  misclassification  rate  drops  bel<>w  a  predetermined  threshold  on  either  the 
same  training  data  (cross  validatory  test)  or  on  different  testing  data. 

METHODS 

Data  -  Voiceless  Stops  from  TIMIT 

In  this  work  we  focused  on  feature  extraction  and  classification  of  the 
voiceless  stop  consonants  [p,  t,  k].  The  source  of  our  data  was  the  DARPA 
TIMIT  Acoustic-Phonetic  Continuous  Speech  Corpus  (TIMIT).  This  database 
contains  utterances  from  many  talkers,  with  coverage  of  all  the  major  dialect 
regions  in  the  United  States. 

All  tokens  used  in  these  experiments  consisted  of  a  stop  followed  by  a 
vowel.  We  used  only  four  vowel  contexts  [aa,  ao,  er,  iy]  in  the  training  set. 
These  vowels  give  a  recisonable,  but  not  complete  coverage  of  the  vowel  space. 
This  restricted  set  allowed  us  to  test  how  well  the  feature  extraction  general¬ 
ized  to  new  vowel  contexts. 

These  tokens  were  drawn  from  the  utterances  of  268  different  talkers. 
Multiple  talkers  and  various  sentential  contexts  contribute  to  a  fair  degree 
of  variability  between  tokens  of  the  same  CV  type.  The  segment  boundaries 
we  used  were  exactly  those  provided  with  TIMIT.  We  made  no  attempt  to 
sharpen  or  correct  any  misalignments  that  might  exist  in  the  data. 

For  each  CV  type,  an  average  over  the  25  tokens  used  for  training  is 
presented  in  the  cochieagram  matrix  shown  in  Figure  2.  The  vertical  axis  is 
frequency,  low  to  high  from  top  to  bottom,  and  the  horizontal  axis  is  time  for 
each  cochieagram.  Looking  at  the  lower  left  corner  of  the  images,  it  can  be 
seen  that  [p]s  have  low  energy  at  the  high  frequencies,  [t]s  have  a  sharp  burst 
in  the  high  frequencies,  and  [k]s  have  diffuse  energy  in  the  high  frequencies. 
These  features  tend  to  distinguish  between  the  three  voiceless  stops  for  the 
cochieagram  representation. 


Figure  2:  The  output  of  Lyon’s  cochlear  mode!  for  the  12  CV  pairs.  i,From 
top  to  bottom  [k,  t,  p],  and  from  left  to  right  [aa,  ao,  er,  iy].  Each  image  is 
the  average  of  25  tokens  from  each  CV  type  showing  75msec  of  speech  aligned 
to  burst  release.  White  areas  represent  high  energy. 


Training 

In  the  first  experiment  features  were  extracted  from  the  large  represen¬ 
tation  of  the  speech  segment  using  a  BCM  network.  Here  the  BCM  weights 
were  only  affected  by  the  unsupervised  modification  rule.  Classification  was 
accomplished  by  training  a  small  back-propagation  network  with  the  output 
of  the  BCM  network  as  shown  in  Figure  1.  An  important  issue  of  avoiding 
over  fitting  (in  either  of  the  nets)  was  addressed  by  testing  (during  training) 
on  a  third  set  of  tokens  (Pseudo  test  set). 

In  the  second  experiment  the  modification  of  the  hidden  units  of  a  3  layer 
back-propagation  network,  was  a  combination  of  the  BCM  synaptic  modifica¬ 
tion  equations,  and  the  error  propagated  from  the  top  layer.  The  performance 
of  the  networks  in  the  first  and  second  experiments  were  compared  to  the  per¬ 
formance  of  a  simple  back-propagation  network. 
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Training 

Method 

4  Vowels 
Training 

4  Vowels 
Testing 

7  Vowels  1 
Testing  1 

m^m 

73.8% 

72.7% 

BCM/B-P 

83.8% 

81.5% 

B-P 

98.7% 

84.8% 

78.2% 

Table  1:  Comparison  between  classification  using  (1)  projections  from  BCM 
unsupervised  learning  as  input  to  back-propagation;  (2)  a  hybrid  of  BCM 
unsupervised  learning  and  supervised  learning  via  error  back-propagation; 
and  (3)  a  plain  back-propagation  net. 

We  used  two  generalization  paradigms  to  test  the  feature  extraction  and 
classification  ability  of  the  system.  First,  the  standard  type  of  generalization 
to  new  instances  of  the  same  class  was  carried  out.  For  each  of  the  12  CV 
types,  we  tested  with  25  novel  instances*.  This  kind  of  generalization  requires 
the  system  to  categorize  instances  that  fall  within  the  region  of  the  input  space 
it  has  had  experience  with.  Many  recognition  systems  are  specifically  focused 
on  this  kind  of  generalization.  However,  the  second  kind  of  generalization, 
where  a  system  trained  with  a  limited  set  of  contexts  generalizes  well  in 
new  contexts,  is  possibly  more  important.  If  a  system  can  transfer  to  new 
contexts,  or  to  a  region  of  the  input  space  it  has  not  experienced,  the  set  of 
abstract  features  it  is  using  must  be  capturing  highly  relevant  aspects  of  the 
input  training  space.  The  ability  to  discover  such  features  strongly  suggests 
the  technique  being  used  is  well  suited  for  robust  speech  recognition.  We 
demonstrate  this  kind  of  generalization  by  training  on  four  vowel  contexts 
[aa,  ao,  er,  iy],  and  testing  with  the  seven  vowel  contexts  [uh,  ih,  eh,  ae,  ah, 
uw,  ow]. 

RESULTS  AND  DISCUSSION 

A  comparison  between  the  different  training  methods  is  shown  in  Table  1. 
The  low  dimensional  projections  of  the  cochleagrams  discovered  with  BCM 
learning,  served  as  input  to  a  small  back- propagation  network  to  yield  the  first 
set  of  results.  This  training  method  yielded  reasonable  performance  on  the 
training  set,  and  very  nearly  the  same  performance  on  the  two  test  sets.  The 
small  difference  in  generalization  to  instances  of  the  same-4-vowel-contexts 
test  set  and  generalization  to  instances  from  the  new-7-vowel-contexts  test 
set  implies  the  features  discovered  with  this  method  are  good  abstractions, 
and  robust.  The  weight  matrices  of  the  eight  units  used  in  the  BCM  network 
are  shown  in  Figure  3. 

Features  distinguishing  between  the  different  bursts  are  evident.  The 
synaptic  weight  image  on  the  top  row,  furthest  to  the  right  shows  a  white  area 
in  the  high  frequencies  which  corresponds  to  a  distinguishing  feature  between 

‘There  were  only  21  new  tokens  available  for  [pao].  All  ether  CV  groups  had  25  tokens. 


[t]  and  [k].  The  image  directly  below  is  useful  for  distinguishing  [p]  from  the 
other  two  stops. 


Figure  3:  The  synaptic  weight  matrices  for  8  units  after  unsupervised  training 
on  25  tokens  of  each  CV  type. 

The  results  of  the  second  training  method,  in  which  error  back-propagation 
was  modified  to  incorporate  BCM-like  constraints,  are  shown  on  the  sec¬ 
ond  line  of  Table  1.  This  novel  integration  of  supervised  and  unsupervised 
techniques  boosted  the  performance  significantly  over  the  previous  training 
method.  However,  the  pattern  of  results  are  very  much  the  same;  good  and 
nearly  equal  performance  with  both  types  of  generalization. 

In  contrcist,  this  pattern  was  not  found  with  the  plain  back-propagation 
net.  While  it  did  achieve  the  best  performance  of  the  three  networks  on  the 
training  set,  it  did  not  transfer  its  good  generalization  performance  on  the 
same-4-vowel-contexts  to  the  new-7-vowel-contexts  test  set.  Straight  back- 
propagation  training  only  attempts  to  minimize  errors  with  the  training  set. 
It  does  not  necessarily  search  for  abstract  features. 

At  this  point,  the  only  comparison  we  can  make  with  HMM  performance 
is  very  loose.  Niles  [14]  constructed  a  baseline  HMM  system  to  classify  the 
standard  set  of  39  phonetic  classes  in  TIMIT.  The  speech  was  preprocessed 
using  an  order-18  LPC  cepstral  analysis,  and  then  VQ  codebooks  for  the 
cepstra,  time-differenced  cepstra,  log  energy,  and  delta  log  energy  were  used 
as  input.  A  three  state  HMM  was  trained  up  for  each  phoneme.  This  system 
classified  82.0  percent  correct  when  tested  with  just  the  voiceless  stops.  While 
this  does  give  a  ballpark  indication  that  the  systems  we  investigated  here  are 
doing  reasonably  well,  any  further  comparison  is  precluded  by  methodological 
differences.  For  instance,  Niles  trained  the  HMMs  for  voiceless  stops  with  all 
phonetic  contexts,  while  our  tokens  always  had  a  following  vowel.  Also,  the 
HMM  system  was  used  as  a  baseline  system,  and  was  not  fine  tuned. 

These  preliminary  results  suggest  that  BCM  training  can  be  beneficially 
incorporated  into  a  network  architecture/training- paradigm  for  speech  recog- 


nition.  Moreover,  the  cochleagram  input  representation  produced  by  Lyon’s 
cochlear  model  contains  details  about  the  speech  events  that  are  useful  in 
classifying  speech  tokens.  A  set  of  experiments  making  specific,  quantitative 
comparisons  between  the  system  we  have  proposed  here  and  current  HMM 
methods  is  planned. 
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